HipKittens MXFP8 GEMM Support by alextmagro · Pull Request #566 · ROCm/TransformerEngine

alextmagro · 2026-04-28T05:16:00Z

Creates an MXFP8 GEMM with HipKittens that outperforms hipBLASlt, and offers additional epilogues such as BIAS and GELU AUX

Requires a workspace sized relative to the model. Often larger than hipBLASlt, but with significant performance improvements. Only builds for gfx950, and requires M / 256 and N / 256.

Adds hipKittens header library as a submodule.

conflicts

ipanfilo · 2026-05-08T16:46:34Z

                         [](const testing::TestParamInfo<DqGEMMTestSuite::ParamType>& info) {
-                           return MKN(std::get<0>(info.param)) + "x" + TN(std::get<3>(info.param));
+                           return MKN(std::get<0>(info.param)) + "x" +
+                                  std::to_string(std::get<1>(info.param)) + "x" +


What is a point, they are set to false only

ipanfilo · 2026-05-08T17:15:28Z


-    return torch.empty(get_cublas_workspace_size_bytes(), dtype=torch.uint8, device=device)
+    key = (device, ub, grouped_gemm)
+    ws = _workspace_cache.get(key)


Why we don't rely on torch memory caching?

I have made this change. I will need to run an E2E run to make sure that performance isn't affected, but should be ok given my understanding of torch.empty()

It doesn't seem changed.

ipanfilo · 2026-05-14T23:09:35Z

+  if (use_hipkittens) {
+    auto param = CanonicalizeGemmInput(*inputA, transa, *inputB, transb, m, n, k);
+
+    hipStream_t s = use_service_stream ? ss_ctl.stream : stream;


the same like with is_mxfp8, no point of having it defined for one branch only

ipanfilo · 2026-05-15T00:21:26Z

@@ -743,12 +786,15 @@ MAKE_DQ_GEMM_TEST(Testfp8xfp8xfp16, fp8, fp8, fp16)

 INSTANTIATE_TEST_SUITE_P(OperatorTest, DqGEMMTestSuite,


If you end up with having separate prefix for MXFP8, it has to be use for this suite for consistency

ipanfilo

Some comments in rocm_gemm and test_cublas_gemm are still open

HipKittens MXFP8 GEMM Support

f9d5ce2

alextmagro requested review from aris134, matthiasdiener and zstreet87 April 28, 2026 05:16

alextmagro requested review from ipanfilo, wangye805 and wenchenvincent as code owners April 28, 2026 05:16

alextmagro added the ci-level 1 CI test level 1 label Apr 28, 2026

wangye805 requested changes May 1, 2026

View reviewed changes

alextmagro added 3 commits May 5, 2026 15:05

Update HipKittens branch after upstream MXFP8 merge

aac5860

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

c917ed0

Update HipKittens commit and address PR comments

3a91321

alextmagro requested a review from wangye805 May 5, 2026 20:26

alextmagro added 5 commits May 5, 2026 20:26

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8 with

cc719fe

conflicts

Resolve conflicts, ensure fp4 workspace changes are harmonious

fcda154

min workspace size guaranteed

70fba6d

add hipkittens to wheels

455002e

fix issue with gfx942 for unified build

ba60ef5

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

ipanfilo requested changes May 8, 2026

View reviewed changes

alextmagro added 2 commits May 12, 2026 02:59

Cleanup and workspace changes

f72b7b8

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

731640a

alextmagro requested review from aris134 and ipanfilo May 12, 2026 13:24

alextmagro added 3 commits May 12, 2026 16:56

fix jax import issue

1960c06

Fix autotuning bug

320152e

fix pytorch import

a280cf7

alextmagro requested a review from ipanfilo May 14, 2026 17:18

alextmagro added ci-level 3 CI test level 3 and removed ci-level 1 CI test level 1 labels May 14, 2026

matthiasdiener reviewed May 14, 2026

View reviewed changes

Comment thread transformer_engine/jax/cpp_extensions/gemm.py

Fix whitespaces and comment issues

f66f77c

ipanfilo reviewed May 15, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/cpp_extensions/gemm.py Outdated

alextmagro added 5 commits May 18, 2026 17:52

Kernel optimizations

0b6e702

Add use_hipkittens_mxfp8 bool to test_cublaslt_gemm.cu

816c752

rocm_gemm.cu cleanup

aaa88d7

Add env check to jax file

e2203c0

Simplify Workspace Check

7648594

alextmagro requested a review from ipanfilo May 18, 2026 20:43

alextmagro added 5 commits May 18, 2026 22:49

Revert kernel optimizations

03f675b

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

f852c22

Readd dropped test code

3b307bb

Skip unsupported MXFP8 FSDP tests

33a5c45

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

cfa5fac

ipanfilo reviewed May 29, 2026

View reviewed changes

Comment thread tests/cpp/operator/test_cublaslt_gemm.cu Outdated

Fix inverted workspace logic in tests

44ea357

alextmagro requested a review from ipanfilo May 29, 2026 19:34

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

d76a8d2

ipanfilo reviewed Jun 4, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

alextmagro added 2 commits June 17, 2026 20:36

HK commit update and kernel optimization

495e271

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

14fb3e1

alextmagro requested a review from ipanfilo June 17, 2026 21:09

ipanfilo reviewed Jun 18, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

Comment thread transformer_engine/pytorch/cpp_extensions/gemm.py

Comment thread transformer_engine/jax/cpp_extensions/gemm.py

Add comments and caching

01b5203

alextmagro requested a review from ipanfilo June 18, 2026 22:47

ipanfilo reviewed Jun 19, 2026

View reviewed changes

		@@ -743,12 +786,15 @@ MAKE_DQ_GEMM_TEST(Testfp8xfp8xfp16, fp8, fp8, fp16)

		INSTANTIATE_TEST_SUITE_P(OperatorTest, DqGEMMTestSuite,

Conversation

alextmagro commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo May 8, 2026

Choose a reason for hiding this comment

Uh oh!

alextmagro May 12, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ipanfilo May 15, 2026

Choose a reason for hiding this comment

Uh oh!

alextmagro May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alextmagro commented Apr 28, 2026 •

edited

Loading